Homework 3

Author: Dawid Pludowski

Data preparation

Most data preprocessing was done for purpose of previous homeworks; only python-wise preprocessing, such as managing with categories, is required.

Creating models

Decision tree

Random forest

Neural network

Choosing best model

Decision tree's performance get best scoring, so we will consider it as base model in the following part of the notebook.

Homework p.1

Predicted value is close to real one.

Homework p.2

Despite the highest scoring, decision tree's decisions are based only on 2-3 variables out of 12 (rest of plots are not shown for sake of notebook clarity). It may suggest that this kind of model cannot use full information that is hidden in data and thus, other models should be considered.

The greatest impact on the prediciton has median_income and it follows the rule the richer inhabitants are, the more expensive the neighbourhood is, which is reasonable.

Homework p.3

The models comparsion in that certain observation shows that neural network might be more sensitive on changes in households variables. Moreover, full CP plots (not shown in the notebook) suggest that change in any variable has impact on network decision, whilst it is not true for random forest and decision tree. Further analysis should be performed to check whether neural network changes in CP profiles are reasonable; if so, neural network should be considered as the best model to estimate median price, as its scoring is only a bit lower than in decision tree and its prediction is more subtle.

Conclusion

CP plots show that the best model (in terms of scoring) may not take all information into account and due to that fact, be poor explainer of the real world. However, one should remember that dataset do contain some interactions (like longitute and latitude) and correlation (ratio variables) and because of that, CP plots are not methods to explain model performance.